library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
library(e1071)
library(igraph)
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:plotly':
## 
##     groups
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(xgboost) 
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
## 
##     slice
## The following object is masked from 'package:dplyr':
## 
##     slice
library(archdata) 
library(caret) 
## Loading required package: lattice
library(Ckmeans.1d.dp) 
library(magrittr)

Part 1

Let \(C\) represents the classification variable and \(c\) be the value of \(C\); let us consider just two classes \(\{-,+\}\). According to the bayes rule, the probability of a sample \(\mathbf{X}=(x_1,x_2,....,x_n)\) being class c is: \[p(c|\mathbf{X})=\frac{p(\mathbf{X}|c)p(c)}{p(E)}\] \(\mathbf{X}\) is classified as the class \(C=-\), if and only if \[f_b(\mathbf{X})=\frac{p(C=+|\mathbf{X})}{p(C=-|\mathbf{X})}\leq1\] where \(f_b(\mathbf{X})\) is called a Bayesian classifier. The Naive Bayesian is a particular Bayesian classifier which simplifies learning by assuming that the features are independent given the class \[P(\mathbf{x},c)=\prod_{i=1}^nP(x_i|c)\] where \(\mathbf{x}=(x_1,x_2,..,x_n)\) is a feature vector and \(c\) is the class. One surprising aspect of this classifier is that it outperforms many other sophisticated methods, despite its assumption of conditional independece, which is rarely found in real situations.

One explanation of this behaviour was given by Domingos and Pazzani (1997); they thought that although the Bayesian classifier’s probability estimates are optimal under quadratic loss if the independence assumption holds, the classifier can be optimal under zero-one loss (misclassification rate) even when this assumption is violated. The zero-one loss does not penalize the probability estimation as long as the maximum probability is assigned to the correct class, so this classifier may change the posterior probabilities of each class apart from the class with the maximum (posterior) probability. They also showed that the region of quadratic-loss optimality of the Bayesian classifier is a second-order infinitesimal fraction of the region of zero-one optimality, implying that this classifier is an optimal learner and it can be applied in different situations. For example, the true probabilities \(p(+|E)=0.9\) and \(p(-|E)=0.1\) and the probability estimates \(\hat{p}(+|E)=0.6\) and \(\hat{p}(-|E)=0.4\) are produced by Naive Bayes classifier. Obviously, the probability estimates are poor, but the classification (positive) is not affected.

Another researcher Zhang (2004) thought that the explanation given by Domingos and Pazzani was not sufficient to explain why the dependencies among attributes do not make the classification less accurate. The author provides an explanation of the behaviour of this classifier concerning the dependencies among attributes: when they work together, when they cancel each other and do not affect the classification. Zhang uses a DAG (directed acyclic graph) to explicitly represent the dependencies among attributes, where the class node is directly linked with the features nodes and there are also links between features nodes (ANB, Augmented Naive Bayes) as follows:

edges <- c(1,2,1,3,1,4,1,5,4,5,2,3,4,3)
g<-graph(edges, n=max(edges), directed=TRUE)
plot(g)

This graph represents the following joint probability distribution \[p_G(x_1,...,x_n,c) = P(c)\prod_{i=1}^{n} P(x_i\mid pa(x_i),c)\]

where \(pa(x_i)\) denotes the parents of the node \(x_i\) in the graph, and \(c\) is the root (class) node. Given just two classes \(\{-,+\}\), for a node \(x\) its local dependence can be measured as the ratio between the conditional probability of the node given its parents and the root, over the conditional probability of the node conditioned on the class node:

\[dd_G^+(x\mid pa(x))=\frac{p(x\mid pa(x), +)}{p(x\mid +)}\] \[dd_G^-(x\mid pa(x))=\frac{p(x\mid pa(x), -)}{p(x\mid -)}\]

Taking the ratio one can quantify the influence of \(X's\) local dependence on the classification: \[ddr_G(x)= \frac{dd_{G}^+(x \mid pa(x))}{dd_{G}^-(x \mid pa(x))}\]

there can be the following results:

  1. When \(X\) has no parent \(ddr_G(x)=1\), because \(dd_G^+=dd_G^-=1\) .

  2. When the local dependencies in both classes support different classifications, they partially cancel each other out, and the final classification will be the class with the greatest local dependence. This shows that the ratio of the local dependencies ultimately determines which classification the local dependence of a node supports.

Given an ANB graph and its correspondent naive Bayes \(G_{nb}\) (removing all arcs between features in the ANB graph), it is true that \[f_b(x_1, x_2, ...,x_n)=f_{nb}(x_1, x_2, ..., x_n) \cdot \prod_{i=1}^n ddr_G(x_i)\]

where \(f_b\) and \(f_{nb}\) are the Bayesian and Naive Bayesian classifiers respectively, and \(\displaystyle \prod_{i=1}^n ddr_G(x_i)\) is the dependence distribution factor. Under Zero-One Loss, \(f_b(x_1, x_2, ..., x_n)=f_{nb}(x_1, x_2, ..., x_n)\) if and only if \(f_b(x_1, x_2, ...,x_n) \ge 1\) and \(\displaystyle \prod_{i=1}^n ddr_G(x_i) \le f_b(x_1, x_2, ...,x_n)\), or when \(f_b(x_1, x_2, ..., x_n) < 1\), \(\displaystyle\prod_{i=1}^nddr_G(x_i) > f_b(x_1, x_2, ..., x_n)\).

References:

1)Domingos, P., and Pazzani, M. 1997. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. [online] Available at: https://www.ics.uci.edu/~pazzani/Publications/MLJ97.pdf

2)Zhang, H. 2004. The Optimality of Naive Bayes. American Association for Artificial Intelligence. [online] Available at: http://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf

Part 2

The main goal of this part is to analyze a dataset with only two outcomes and try to implement two different binary classification models, the LDA classifier and the Naive Bayes classifier.

load('daily-sport.RData')

Take a look at data

# Information about dataset
str(dailysport)
## 'data.frame':    30000 obs. of  46 variables:
##  $ id      : Factor w/ 4 levels "crosstr","jumping",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ T-xAcc  : num  8.72 8.62 8.93 9.07 9.05 ...
##  $ T-yAcc  : num  2.14 2.05 2.05 2.08 1.98 ...
##  $ T-zAcc  : num  3.83 3.45 3.28 3.23 3.41 ...
##  $ T-xGyro : num  0.294 0.23 0.209 0.197 0.238 ...
##  $ T-yGyro : num  -0.252 -0.26 -0.344 -0.241 -0.098 ...
##  $ T-zGyro : num  -0.1506 -0.13245 -0.10109 -0.06141 -0.00907 ...
##  $ T-xMag  : num  -0.577 -0.586 -0.596 -0.605 -0.61 ...
##  $ T-yMag  : num  0.1086 0.0974 0.0896 0.0825 0.075 ...
##  $ T-zMag  : num  -0.697 -0.69 -0.684 -0.675 -0.672 ...
##  $ RA-xAcc : num  8.59 8.5 9.09 9.36 9.34 ...
##  $ RA-yAcc : num  3.51 3.77 3.8 3.77 3.94 ...
##  $ RA-zAcc : num  1.58 1.64 1.62 1.68 1.83 ...
##  $ RA-xGyro: num  0.492 0.252 0.168 0.25 0.269 ...
##  $ RA-yGyro: num  -0.224 -0.381 -0.421 -0.352 -0.276 ...
##  $ RA-zGyro: num  0.532 0.65 0.706 0.597 0.373 ...
##  $ RA-xMag : num  -0.532 -0.55 -0.574 -0.594 -0.614 ...
##  $ RA-yMag : num  -0.476 -0.472 -0.461 -0.452 -0.443 ...
##  $ RA-zMag : num  -0.594 -0.582 -0.571 -0.559 -0.545 ...
##  $ LA-xAcc : num  9.22 8.47 8.49 8.43 8.08 ...
##  $ LA-yAcc : num  -2.77 -2.76 -3.02 -3.07 -3.59 ...
##  $ LA-zAcc : num  3.06 2.72 2.51 2.54 3.06 ...
##  $ LA-xGyro: num  0.301 0.0294 -0.0944 -0.6035 -1.0688 ...
##  $ LA-yGyro: num  -0.0141 -0.0514 -0.0233 0.1012 0.1644 ...
##  $ LA-zGyro: num  -0.488 -0.379 -0.342 -0.325 -0.265 ...
##  $ LA-xMag : num  -0.484 -0.492 -0.503 -0.511 -0.52 ...
##  $ LA-yMag : num  0.727 0.717 0.712 0.709 0.711 ...
##  $ LA-zMag : num  -0.256 -0.262 -0.261 -0.251 -0.229 ...
##  $ RL-xAcc : num  -11.65 -10.9 -10.09 -8.85 -8.46 ...
##  $ RL-yAcc : num  0.112 -1.163 -1.985 -0.721 0.802 ...
##  $ RL-zAcc : num  0.464 1.118 1.285 1.398 1.577 ...
##  $ RL-xGyro: num  -1.69 -1.268 -0.476 -0.305 -0.278 ...
##  $ RL-yGyro: num  0.411 0.437 0.148 0.195 0.235 ...
##  $ RL-zGyro: num  1.55 1.68 1.57 1.55 1.36 ...
##  $ RL-xMag : num  0.794 0.763 0.727 0.689 0.646 ...
##  $ RL-yMag : num  -0.427 -0.485 -0.54 -0.587 -0.636 ...
##  $ RL-zMag : num  0.192 0.177 0.171 0.166 0.164 ...
##  $ LL-xAcc : num  -9.74 -9.53 -9.97 -9.66 -9.77 ...
##  $ LL-yAcc : num  -0.434 -0.495 1.045 -0.436 -0.523 ...
##  $ LL-zAcc : num  -1.46 -1.55 -1.27 -1.04 -1.66 ...
##  $ LL-xGyro: num  -0.291 -0.348 -0.417 -0.234 -0.344 ...
##  $ LL-yGyro: num  0.0618 0.0619 0.0789 -0.0764 0.0258 ...
##  $ LL-zGyro: num  0.32 0.291 0.297 0.478 0.407 ...
##  $ LL-xMag : num  0.8 0.805 0.811 0.817 0.823 ...
##  $ LL-yMag : num  0.437 0.43 0.423 0.412 0.397 ...
##  $ LL-zMag : num  -0.166 -0.159 -0.149 -0.144 -0.143 ...
summary(dailysport)
##        id           T-xAcc            T-yAcc             T-zAcc      
##  crosstr:7500   Min.   :-11.526   Min.   :-14.0190   Min.   :-8.363  
##  jumping:7500   1st Qu.:  6.584   1st Qu.: -0.3961   1st Qu.: 1.420  
##  stepper:7500   Median :  8.751   Median :  0.3336   Median : 2.848  
##  walking:7500   Mean   :  9.227   Mean   :  0.3792   Mean   : 3.029  
##                 3rd Qu.: 10.936   3rd Qu.:  1.3094   3rd Qu.: 3.972  
##                 Max.   : 70.835   Max.   : 14.0120   Max.   :33.093  
##     T-xGyro             T-yGyro            T-zGyro          
##  Min.   :-4.696900   Min.   :-9.85770   Min.   :-2.1615000  
##  1st Qu.:-0.330675   1st Qu.:-0.26165   1st Qu.:-0.1770200  
##  Median : 0.016853   Median : 0.02464   Median :-0.0009475  
##  Mean   : 0.001798   Mean   : 0.01580   Mean   : 0.0014212  
##  3rd Qu.: 0.359052   3rd Qu.: 0.33622   3rd Qu.: 0.1753525  
##  Max.   : 4.530300   Max.   : 5.59380   Max.   : 1.9199000  
##      T-xMag            T-yMag             T-zMag           RA-xAcc        
##  Min.   :-0.9246   Min.   :-0.58331   Min.   :-0.8606   Min.   :-25.2760  
##  1st Qu.:-0.7513   1st Qu.:-0.15989   1st Qu.:-0.6281   1st Qu.: -0.1473  
##  Median :-0.6952   Median :-0.06689   Median :-0.4968   Median :  2.7679  
##  Mean   :-0.7000   Mean   : 0.02185   Mean   :-0.4154   Mean   :  3.1426  
##  3rd Qu.:-0.6171   3rd Qu.: 0.25817   3rd Qu.:-0.2799   3rd Qu.:  7.7249  
##  Max.   :-0.4142   Max.   : 0.56106   Max.   : 0.2717   Max.   : 21.5890  
##     RA-yAcc           RA-zAcc           RA-xGyro       
##  Min.   :-10.587   Min.   :-16.557   Min.   :-7.75480  
##  1st Qu.:  2.839   1st Qu.:  1.847   1st Qu.:-0.43318  
##  Median :  4.864   Median :  3.967   Median : 0.01838  
##  Mean   :  6.086   Mean   :  4.166   Mean   : 0.01600  
##  3rd Qu.:  7.529   3rd Qu.:  6.468   3rd Qu.: 0.47524  
##  Max.   : 51.797   Max.   : 39.010   Max.   : 9.55990  
##     RA-yGyro            RA-zGyro            RA-xMag        
##  Min.   :-6.432800   Min.   :-4.343000   Min.   :-0.94995  
##  1st Qu.:-0.365402   1st Qu.:-0.654562   1st Qu.:-0.46215  
##  Median :-0.014138   Median :-0.017570   Median :-0.22478  
##  Mean   :-0.009334   Mean   : 0.002844   Mean   :-0.25362  
##  3rd Qu.: 0.362628   3rd Qu.: 0.635118   3rd Qu.:-0.01952  
##  Max.   : 4.853700   Max.   : 5.177800   Max.   : 0.93707  
##     RA-yMag           RA-zMag           LA-xAcc            LA-yAcc       
##  Min.   :-1.1059   Min.   :-1.1374   Min.   :-31.6400   Min.   :-57.939  
##  1st Qu.:-0.7037   1st Qu.:-0.6716   1st Qu.:  0.6757   1st Qu.: -6.833  
##  Median :-0.5470   Median :-0.5559   Median :  4.3583   Median : -4.362  
##  Mean   :-0.4846   Mean   :-0.5029   Mean   :  4.0236   Mean   : -5.387  
##  3rd Qu.:-0.2958   3rd Qu.:-0.3964   3rd Qu.:  8.1350   3rd Qu.: -2.387  
##  Max.   : 0.3295   Max.   : 0.3614   Max.   : 50.9240   Max.   : 30.246  
##     LA-zAcc            LA-xGyro            LA-yGyro        
##  Min.   :-23.7440   Min.   :-8.833200   Min.   :-8.381800  
##  1st Qu.:  0.7575   1st Qu.:-0.469542   1st Qu.:-0.382228  
##  Median :  3.2938   Median :-0.004280   Median :-0.024567  
##  Mean   :  2.9418   Mean   :-0.007128   Mean   :-0.004787  
##  3rd Qu.:  6.2104   3rd Qu.: 0.426720   3rd Qu.: 0.368470  
##  Max.   : 24.6540   Max.   :18.809000   Max.   : 4.358900  
##     LA-zGyro             LA-xMag            LA-yMag       
##  Min.   :-12.935000   Min.   :-0.93608   Min.   :-0.9267  
##  1st Qu.: -0.624395   1st Qu.:-0.49630   1st Qu.: 0.3048  
##  Median :  0.027278   Median :-0.03208   Median : 0.4356  
##  Mean   : -0.008746   Mean   :-0.14015   Mean   : 0.3854  
##  3rd Qu.:  0.622652   3rd Qu.: 0.21392   3rd Qu.: 0.5705  
##  Max.   :  5.679100   Max.   : 0.89720   Max.   : 0.8499  
##     LA-zMag           RL-xAcc           RL-yAcc           RL-zAcc        
##  Min.   :-1.1028   Min.   :-89.704   Min.   :-52.727   Min.   :-47.9250  
##  1st Qu.:-0.6849   1st Qu.:-11.917   1st Qu.: -1.034   1st Qu.: -1.1442  
##  Median :-0.4257   Median : -9.704   Median :  1.115   Median : -0.1701  
##  Mean   :-0.2640   Mean   :-10.051   Mean   :  1.084   Mean   : -0.2483  
##  3rd Qu.: 0.1127   3rd Qu.: -7.624   3rd Qu.:  3.362   3rd Qu.:  0.8341  
##  Max.   : 0.7589   Max.   :  3.008   Max.   : 80.427   Max.   : 17.0360  
##     RL-xGyro           RL-yGyro            RL-zGyro        
##  Min.   :-5.14820   Min.   :-2.081100   Min.   :-3.185800  
##  1st Qu.:-0.52341   1st Qu.:-0.198860   1st Qu.:-1.127500  
##  Median : 0.04027   Median : 0.008418   Median :-0.064018  
##  Mean   : 0.01691   Mean   : 0.012355   Mean   :-0.003762  
##  3rd Qu.: 0.54979   3rd Qu.: 0.219955   3rd Qu.: 1.074200  
##  Max.   : 6.01390   Max.   : 3.311100   Max.   : 3.665900  
##     RL-xMag          RL-yMag           RL-zMag            LL-xAcc       
##  Min.   :0.3556   Min.   :-0.7856   Min.   :-0.62804   Min.   :-93.940  
##  1st Qu.:0.4882   1st Qu.:-0.2595   1st Qu.:-0.32134   1st Qu.:-11.757  
##  Median :0.6394   Median :-0.1207   Median :-0.12695   Median : -9.556  
##  Mean   :0.6510   Mean   :-0.1197   Mean   :-0.01181   Mean   : -9.989  
##  3rd Qu.:0.8168   3rd Qu.:-0.0391   3rd Qu.: 0.31716   3rd Qu.: -7.699  
##  Max.   :1.0715   Max.   : 0.7485   Max.   : 0.57962   Max.   :  3.592  
##     LL-yAcc            LL-zAcc            LL-xGyro       
##  Min.   :-95.8980   Min.   :-27.1510   Min.   :-7.15920  
##  1st Qu.: -4.1598   1st Qu.: -1.4636   1st Qu.:-0.56632  
##  Median : -1.7903   Median : -0.3309   Median :-0.06929  
##  Mean   : -1.7952   Mean   : -0.4361   Mean   :-0.03791  
##  3rd Qu.:  0.4201   3rd Qu.:  0.7115   3rd Qu.: 0.48713  
##  Max.   : 49.4040   Max.   : 33.0620   Max.   :10.38100  
##     LL-yGyro            LL-zGyro            LL-xMag      
##  Min.   :-3.605000   Min.   :-4.115000   Min.   :0.2951  
##  1st Qu.:-0.227410   1st Qu.:-1.116200   1st Qu.:0.4771  
##  Median : 0.004304   Median : 0.018875   Median :0.6287  
##  Mean   : 0.004683   Mean   :-0.006432   Mean   :0.6244  
##  3rd Qu.: 0.231900   3rd Qu.: 1.148025   3rd Qu.:0.7691  
##  Max.   : 4.207600   Max.   : 3.526500   Max.   :1.0377  
##     LL-yMag           LL-zMag        
##  Min.   :-0.7177   Min.   :-0.52303  
##  1st Qu.: 0.1047   1st Qu.:-0.27879  
##  Median : 0.2789   Median : 0.01363  
##  Mean   : 0.2626   Mean   :-0.02575  
##  3rd Qu.: 0.4749   3rd Qu.: 0.14591  
##  Max.   : 1.0249   Max.   : 0.62156

In this dataset there are 30000 observations of 46 variables, where 45 over to 46 of them are data taken by 9 different sensors (x,y,z accelerometers, x,y,z gyroscopes, x,y,z magnetometers) arranged on torso (T), right arm (RA), left arm (LA), right leg (RL), left leg (LL) of one person who did 4 different activities (walking, stepper, cross trainer, jumping), this is the 46th variable, which is a factor variable and it allows us to identify these 4 activities (7500 variables per activity).

Check if there are some NA values in our dataset

sum(is.na(dailysport))
## [1] 0

There are not NA values so it is not necessary to deal with them.

The main goal is to distinguish these human activities using the data above.

Firstly we are going to reduce the dimension of our dataset, taking into account just 2 activities (jumping and walking) and only 3 sensors on just one location (x,y,z accelerometers on Torso (T)).

#small dataset
ds.small<-dailysport %>% subset(id=='walking' | id=='jumping',select=c('id','T-xAcc','T-yAcc','T-zAcc'))%>%droplevels()
colnames(ds.small)<-c('id','T_xAcc','T_yAcc','T_zAcc')

Take a look at the main properties of these three components for the two different classes

summary(ds.small %>% subset(id=='walking')%>%droplevels())
##        id           T_xAcc           T_yAcc           T_zAcc      
##  walking:7500   Min.   : 5.252   Min.   :-1.583   Min.   :0.6822  
##                 1st Qu.: 8.062   1st Qu.: 0.299   1st Qu.:3.0790  
##                 Median : 8.785   Median : 1.147   Median :3.5861  
##                 Mean   : 9.068   Mean   : 1.144   Mean   :3.4957  
##                 3rd Qu.:10.079   3rd Qu.: 1.939   3rd Qu.:4.0489  
##                 Max.   :15.455   Max.   : 4.448   Max.   :6.4017
summary(ds.small %>% subset(id=='jumping')%>%droplevels())
##        id           T_xAcc             T_yAcc             T_zAcc      
##  jumping:7500   Min.   :-11.5260   Min.   :-14.0190   Min.   :-8.363  
##                 1st Qu.: -0.8114   1st Qu.: -0.3693   1st Qu.:-0.150  
##                 Median :  3.4878   Median :  0.1510   Median : 1.659  
##                 Mean   :  9.0261   Mean   :  0.2096   Mean   : 3.638  
##                 3rd Qu.: 18.8015   3rd Qu.:  0.7745   3rd Qu.: 6.367  
##                 Max.   : 70.8350   Max.   : 14.0120   Max.   :33.093

One can see in the considered dataset:

  1. For the walking class the ‘T_yAcc’ and the ‘T_zAcc’ are features which have the same median and mean, whereas the ‘T_xAcc’ has different values for them.

  2. For the jumping class the ‘T_xAcc’ and the ‘T_zAcc’ are features which have different values for the median and mean, whereas the ‘T_yAcc’ has the same values for them.

Take a look at this point, treating them as a 3D point cloud in the Euclidean space, to plot them we are going to consider just 1000 random points.

set.seed(123)
# sample 1000 points from our dataset
points<-ds.small[sample(nrow(ds.small),1000),]

# 3D scatterplot
plot_ly(points, x = ~T_xAcc, y = ~T_yAcc, z = ~T_zAcc,color=~id,colors = c('green', 'purple'),marker=list(size=4)) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'x'),
                     yaxis = list(title = 'y'),
                     zaxis = list(title = 'z')))

The plot shows that there is a big concentration of points for both classes around the origin, whereas looking at the graph around an \(x=50\) the points of jumping class are more scattered, so the points of walking class are concentrated around the origin forming one cluster which is inside the cluster formed by the points of the jumping class; one can guess that there is no linear classifier that builds a plane in the euclidean space which divides the two clusters perfectly: it means that whatever linear classifier will be used, it will not work well, how it is shown from the following analysis. Firstly data are splitted in train-set and test-set taking 70% and 30% of them respectively.

## 70% of the sample size
samp_size <- floor(0.70 * nrow(ds.small))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(ds.small)), size = samp_size)

ds.train <- ds.small[train_ind, ]
ds.test <- ds.small[-train_ind, ]

Given them we may proceed to estimate:

  1. \(f_1(.)\) from the \(\mathbf{X_i}\) for which \(Y_i=1\), obtaining \(\hat{f_1}(.)\)

  2. \(f_0(.)\) from the \(\mathbf{X_i}\) for which \(Y_0=0\), obtaining \(\hat{f_0}(.)\)

  3. the subpopulation proportion \(\pi_1\) whith \(\hat{\pi_1}\) with \(\hat{\pi_1}=\sum_{i=1}^nY_i\)

  4. the subpopulation proportion \(\pi_0\) whith \(\hat{\pi_0}\) with \(\hat{\pi_0}=\sum_{i=1}^nY_i\)

and define

\[\hat{r}(\mathbf{x})=\frac{\hat{\pi_1}\hat{f_1}(x)}{\hat{\pi_1}\hat{f_1}(x)+\hat{\pi_0}\hat{f_0}(x)}\to\hat{\eta}(\mathbf{x})=\begin{cases} 1 & \quad \text{if } \hat{r}(\mathbf{x})>\frac{1}{2}\\ 0 & \quad \text{ otherwise} \end{cases}\]

The simplest approach consists in assuming a parametric model for the class conditional densities \(f_1(.)\) and \(f_0(.)\), so we are going to use the LDA model.

Linear discriminant analysis

The main assumptions are:

  1. The conditional probability density functions \(p(\mathbf{x}|y=0)\) and \(p(\mathbf{x}|y=1)\) are both normally distributed

  2. The class covariances are identical (homoscedasticity assumption)

# LDA
lda.out<-lda(id ~ .,data=ds.train)

After using LDA model, we are able to predict the class-labels on the training set and on the test set

# Prediction on the train set
pred.tr = predict(lda.out, ds.train[,-1])$class

# Prediction on the test set
pred.te = predict(lda.out, ds.test[,-1])$class

After creating the confusion matrix we can evaluate the performance of the model on the train set and on the test set, calculating the empirical error rate, which is defined as \[\hat{L_n}(\eta)=\frac{1}{n}\sum_{i=1}^nI(\eta(\mathbf{X_i})\neq Y_i)\]

# Empirical error rate train set
table(pred.tr , ds.train$id)
##          
## pred.tr   jumping walking
##   jumping    3647    1952
##   walking    1584    3317
mean(pred.tr != ds.train$id)
## [1] 0.3367619
# Empirical error rate test set
table(pred.te , ds.test$id)
##          
## pred.te   jumping walking
##   jumping    1608     834
##   walking     661    1397
mean(pred.te != ds.test$id)
## [1] 0.3322222

As we can see the results are very similiar. One can assume that either the model is underfitted, because we have high values for the train and for the test, or the train set and the test set are based on similar points, how it is possible to see on the plot where points are very close to each other, in particular around the origin.

Naive Bayes Classifier

Let us try to use a Naive Bayes Classifier, assuming that, conditionally on the class label, the feature vector \(\mathbf{X}\) has independent components, so \[f_1(x)=\prod_{j=1}^{d}f_{1,j}(x_j) \quad\ and \quad\ f_0(x)=\prod_{j=1}^{d}f_{0,j}(x_j)\] Now we are going to apply the naive Bayes classifier to implement a classification model using the naiveBayes() function implemented in R, where each feature is assumed to follow a Gaussian distribution. The same train set and test set are used

naive.Bayes.out<-naiveBayes(id ~ .,data=ds.train, type = "raw")

After using Naive Bayes model, we are able to predict the class-labels on the training set and on the test set

# Prediction on the train set
pred.tr = predict(naive.Bayes.out, ds.train[,-1])
# Prediction on the test set
pred.te = predict(naive.Bayes.out, ds.test[,-1])

and evaluate the model using the empirical error rate

# Empirical error rate train set
table(pred.tr , ds.train$id)
##          
## pred.tr   jumping walking
##   jumping    4784      42
##   walking     447    5227
mean(pred.tr != ds.train$id)
## [1] 0.04657143
# Empirical error rate test set
table(pred.te , ds.test$id)
##          
## pred.te   jumping walking
##   jumping    2087      19
##   walking     182    2212
mean(pred.te != ds.test$id)
## [1] 0.04466667

The calculated error rate in the train set and test set are almost the same but the predictions are more accurate than the LDA’s. Indeed 33% of the predictions the LDA classifier makes are wrong whereas the Naive Bayes classifier is just in 4% of the predictions wrong.

We are going to plot the estimated distributions with the gaussian kernel to see if there are significant differences in the two activities for each variable. For each distribution mentioned above we used the bootstrap method to estimate the 95% confidence interval, implemented as follows:

  1. we sampled from our features with replacements

  2. we calculated the density using a gaussian kernel per each iteration

  3. we calculated the quantile function on the bootstrap vector to have the CI

boot.funct<-function(df,class,param,dens){
  subdf<-subset(df,id==class)[,param]
  fit2 <- replicate(1000,{ 
    x <- sample(subdf, replace=TRUE)    
        density(x,kernel='gaussian',from=min(dens$x),to=max(dens$x))$y})
  fit3 <- apply(fit2, 1, quantile, c(0.025,0.975) )
  return(fit3)
}

# Density
wx<-density(subset(ds.small,id=='walking')$T_xAcc,kernel = 'gaussian')
jx<-density(subset(ds.small,id=='jumping')$T_xAcc,kernel = 'gaussian')
final_x=as.data.frame(cbind(wx$x,wx$y,jx$x,jx$y))
colnames(final_x)<-c('walk_x','walk_densx','jump_x','jump_densx')

#Bootstrap starts T_xAcc walking
fitwalk_xCI<-boot.funct(ds.small,'walking','T_xAcc',wx)

#Bootstrap starts T_xAcc jumping
fitjump_xCI<-boot.funct(ds.small,'jumping','T_xAcc',jx)

layout(matrix(c(1,1,1,1,2,3,2,3), nrow = 4, ncol = 2, byrow = TRUE))

plot(jx,ylim=c(0,max(max(wx$y),max(jx$y))),main = 'Kernel Density T_xAcc',xlab='x',col='red')
lines(wx,col='blue')
legend('topright', legend=c("Walking", "Jumping"),col=c("blue", "red"), lty=1, cex=0.8)

#Plot CI Walking x
plot(wx, ylim = range(fitwalk_xCI),main = 'CI density T_xAcc walking',xlab='x')
polygon( c(wx$x, rev(wx$x)), c(fitwalk_xCI[1,], rev(fitwalk_xCI[2,])),col='black', border=F)
lines(wx, col = "red", lwd = 2)

#Plot CI jumping x
plot(jx, ylim = range(fitjump_xCI),main = 'CI density T_xAcc walking',xlab='x')
polygon( c(jx$x, rev(jx$x)), c(fitjump_xCI[1,], rev(fitjump_xCI[2,])),col='black', border=F)
lines(jx, col = "red", lwd = 2)

# Density
wy<-density(subset(ds.small,id=='walking')$T_yAcc,kernel = 'gaussian')
jy<-density(subset(ds.small,id=='jumping')$T_yAcc,kernel = 'gaussian')
final_y=as.data.frame(cbind(wy$x,wy$y,jy$x,jy$y))
colnames(final_y)<-c('walk_y','walk_densy','jump_y','jump_densy')

#Bootstrap starts T_yAcc walking
fitwalk_yCI<-boot.funct(ds.small,'walking','T_yAcc',wy)

#Bootstrap starts T_yAcc jumping
fitjump_yCI<-boot.funct(ds.small,'jumping','T_yAcc',jy)

layout(matrix(c(1,1,1,1,2,3,2,3), nrow = 4, ncol = 2, byrow = TRUE))

plot(jy,ylim=c(0,max(max(wy$y),max(jy$y))),main = 'Kernel Density T_yAcc',xlab='y',col='red')
lines(wy,col='blue')
legend('topright', legend=c("Walking", "Jumping"),col=c("blue", "red"), lty=1, cex=0.8)

#Plot CI Walking y
plot(wy, ylim = range(fitwalk_yCI),main = 'CI density T_yAcc walking',xlab='y')
polygon( c(wy$x, rev(wy$x)), c(fitwalk_yCI[1,], rev(fitwalk_yCI[2,])),col='black', border=F)
lines(wy, col = "red", lwd = 2)

#Plot CI jumping y
plot(jy, ylim = range(fitjump_yCI),main = 'CI density T_yAcc walking',xlab='y')
polygon( c(jy$x, rev(jy$x)), c(fitjump_yCI[1,], rev(fitjump_yCI[2,])),col='black', border=F)
lines(jy, col = "red", lwd = 2)

wz<-density(subset(ds.small,id=='walking')$T_zAcc,kernel = 'gaussian')
jz<-density(subset(ds.small,id=='jumping')$T_zAcc,kernel = 'gaussian')
final_z=as.data.frame(cbind(wz$x,wz$y,jz$x,jz$y))
colnames(final_z)<-c('walk_z','walk_densz','jump_z','jump_densz')

#Bootstrap starts T_yAcc walking
fitwalk_zCI<-boot.funct(ds.small,'walking','T_zAcc',wz)

#Bootstrap starts T_yAcc jumping
fitjump_zCI<-boot.funct(ds.small,'jumping','T_zAcc',jz)

layout(matrix(c(1,1,1,1,2,3,2,3), nrow = 4, ncol = 2, byrow = TRUE))

plot(jz,ylim=c(0,max(max(wz$y),max(jz$y))),main = 'Kernel Density T_zAcc',xlab='z',col='red')
lines(wz,col='blue')
legend('topright', legend=c("Walking", "Jumping"),col=c("blue", "red"), lty=1, cex=0.8)

#Plot CI Walking z
plot(wz, ylim = range(fitwalk_zCI),main = 'CI density T_zAcc walking',xlab='z')
polygon( c(wz$x, rev(wz$x)), c(fitwalk_zCI[1,], rev(fitwalk_zCI[2,])),col='black', border=F)
lines(wz, col = "red", lwd = 2)

#Plot CI jumping z
plot(jz, ylim = range(fitjump_zCI),main = 'CI density T_yAcc walking',xlab='z')
polygon( c(jz$x, rev(jz$x)), c(fitjump_zCI[1,], rev(fitjump_zCI[2,])),col='black', border=F)
lines(jz, col = "red", lwd = 2)

For the variables T_xAcc and T_zAcc the distributions of “Walking” and “Jumping” are different. In both cases the values of “Walking” have low varince whereas the values of “Jumping” have a larger variance, so it means that using individually these 2 variables we are able to distinguish the 2 classes. On the other hand, it is not the same for the variable T_yAcc, because the two curves are overlapping.

Implementation Naive Bayes

The idea is to use a kernel density instead of a multinomial distribution. The individual class-conditional densities \(\{f_{1,j}(·)\}_j\) and \(\{f_{0,j}(·)\}_j\) can each be estimated separately using 1-dimensional kernel density estimates and, since we assume that the variables are independent, the joint class-conditional can be calculated as follows:

\[f_1(x)=\prod_{j=1}^{d}f_{1,j}(x_j) \quad\ and \quad\ f_0(x)=\prod_{j=1}^{d}f_{0,j}(x_j)\]

#density
density_covariate <- function (column) {
    den <- density(column,kernel = 'gaussian')
    funct <- approxfun(den$x, den$y)
    return (funct)
}
#Naive Bayes
Naive_Bayes <- function (newdataset, prior_one_class, prior_second_class,densw,densj) {
    n_hat = rep(NA, nrow(newdataset))
    for (i in 1:nrow(newdataset)){
          row = newdataset[i,]
          post_1 = prior_one_class * densw[[1]](row[2]) * densw[[2]](row[3]) * densw[[3]](row[4])
          post_2 = prior_second_class * densj[[1]](row[2]) * densj[[2]](row[3]) * densj[[3]](row[4])
          r_hat = post_1/(post_1+post_2)
          n_hat[i]=ifelse(is.na(post_1) || r_hat < 0.5,"jumping","walking")
    }
    return (n_hat)
}

ds.w <- subset(ds.train, id=='walking',select = c('T_xAcc','T_yAcc','T_zAcc'))
ds.j <- subset(ds.train, id=='jumping',select = c('T_xAcc','T_yAcc','T_zAcc'))

dens.w<-sapply(ds.w,density_covariate)
dens.j<-sapply(ds.j,density_covariate)

prior_one_class <-  nrow(ds.w)/length(ds.train$id)
prior_second_class <- nrow(ds.j)/length(ds.train$id)


result <- Naive_Bayes (ds.test, prior_one_class, prior_second_class,dens.w,dens.j)
result2 <- Naive_Bayes (ds.train, prior_one_class, prior_second_class,dens.w,dens.j)

Let us evaluate our implementation of Naive Bayes classifier and calculate the error rate:

# Empirical error rate train set
table(result2 , ds.train$id)
##          
## result2   jumping walking
##   jumping    4800      16
##   walking     431    5253
mean(result2 != ds.train$id)
## [1] 0.04257143
# Empirical error rate test set
table(result , ds.test$id)
##          
## result    jumping walking
##   jumping    2093       6
##   walking     176    2225
mean(result != ds.test$id)
## [1] 0.04044444

The results of our implementation are better than the results of the parametric Naive Bayes method in both train and test set. Comparing the results of the personal classifier, one can see that with the method used before the number of True/Negative (jumping/walking) in this case is 16 and 6 respectivily for the train set and the test set, whereas in the model before we obtained 42 and 19. The precision of the classifier is higher because the approximantion of the distributions are less biased, so the error will be less than the previous case.

Part 3

Now we are going to analyze the whole dataset and try to predict all activities instead of two like above. We will use the LDA method to see if we can identify all activities with a linear classifier.

## 70% of the sample size
samp_size <- floor(0.70 * nrow(dailysport))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(dailysport)), size = samp_size)

train <- dailysport[train_ind, ]
test <- dailysport[-train_ind, ]

# LDA
lda.out<-lda(id ~ .,data=train)

# Prediction on the test set
pred.te = predict(lda.out, test[,-1])$class

# Empirical error rate test set
table(pred.te , test$id)
##          
## pred.te   crosstr jumping stepper walking
##   crosstr    2261       0       0       0
##   jumping       0    2225       0       0
##   stepper       0       0    2264       0
##   walking       0       0       0    2250
mean(pred.te != test$id)
## [1] 0

From the result we can say that the 4 groups are well seperated and so a linear classifier is enough to distinguish the 4 classes. Just to support our idea, we also try to use random forest with boosting with the package xgboost.

ds.origin = dailysport

dailysport$id <- as.numeric(dailysport$id)
dailysport$id <- dailysport$id - 1
dailysportM <- as.matrix((dailysport[,-1]))

data_label <- dailysport[,"id"]
dailysportXGB <- xgb.DMatrix(data = as.matrix(dailysport), label = data_label)

## 70% of the sample size
samp_size <- floor(0.70 * nrow(dailysport))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(dailysport)), size = samp_size)

train <- dailysportM[train_ind, ]
train_label  <- data_label[train_ind]
train_matrix <- xgb.DMatrix(data = train, label = train_label)


test <- dailysportM[-train_ind, ]
test_label <- data_label[-train_ind]
test_matrix <- xgb.DMatrix(data = test, label = test_label)


n_class = length(unique(dailysport$id))
params = list('objective' = 'multi:softprob','eval_metric' = 'merror','num_class' = n_class)

nround <- 50
cv.nfold = 5

cv_model <- xgb.cv(params = params,
                   data = train_matrix, 
                   nrounds = nround,
                   nfold = cv.nfold,
                   verbose = FALSE,
                   prediction = TRUE)

OOF_prediction <- data.frame(cv_model$pred) %>% 
  mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)
head(OOF_prediction)
##             X1           X2           X3           X4 max_prob label
## 1 2.565523e-05 2.206023e-05 9.999272e-01 2.508805e-05        3     3
## 2 2.225092e-05 9.999025e-01 2.123043e-05 5.406532e-05        2     2
## 3 2.463430e-04 2.193917e-05 9.997148e-01 1.690017e-05        3     3
## 4 3.986130e-05 9.999193e-01 2.169082e-05 1.915286e-05        2     2
## 5 2.105451e-05 9.999385e-01 2.151386e-05 1.891940e-05        2     2
## 6 2.041578e-05 2.406958e-05 2.086117e-05 9.999347e-01        4     4
confusionMatrix(factor(OOF_prediction$max_prob),
                factor(OOF_prediction$label),
                mode = "everything")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1 5239    0    0    0
##          2    0 5275    0    0
##          3    0    0 5236    0
##          4    0    0    0 5250
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9998, 1)
##     No Information Rate : 0.2512     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000     1.00
## Specificity            1.0000   1.0000   1.0000     1.00
## Pos Pred Value         1.0000   1.0000   1.0000     1.00
## Neg Pred Value         1.0000   1.0000   1.0000     1.00
## Precision              1.0000   1.0000   1.0000     1.00
## Recall                 1.0000   1.0000   1.0000     1.00
## F1                     1.0000   1.0000   1.0000     1.00
## Prevalence             0.2495   0.2512   0.2493     0.25
## Detection Rate         0.2495   0.2512   0.2493     0.25
## Detection Prevalence   0.2495   0.2512   0.2493     0.25
## Balanced Accuracy      1.0000   1.0000   1.0000     1.00
### test

bst_model <- xgb.train(params = params,
                       data = train_matrix,
                       nrounds = nround)

# Predict hold-out test set
test_pred <- predict(bst_model, newdata = test_matrix)
test_prediction <- matrix(test_pred, nrow = n_class,
                          ncol=length(test_pred)/n_class) %>%
  t() %>%
  data.frame() %>%
  mutate(label = test_label + 1,
         max_prob = max.col(., "last"))
# confusion matrix of test set
confusionMatrix(factor(test_prediction$max_prob),
                factor(test_prediction$label),
                mode = "everything")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1 2261    0    0    0
##          2    0 2225    0    0
##          3    0    0 2264    0
##          4    0    0    0 2250
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9996, 1)
##     No Information Rate : 0.2516     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000     1.00
## Specificity            1.0000   1.0000   1.0000     1.00
## Pos Pred Value         1.0000   1.0000   1.0000     1.00
## Neg Pred Value         1.0000   1.0000   1.0000     1.00
## Precision              1.0000   1.0000   1.0000     1.00
## Recall                 1.0000   1.0000   1.0000     1.00
## F1                     1.0000   1.0000   1.0000     1.00
## Prevalence             0.2512   0.2472   0.2516     0.25
## Detection Rate         0.2512   0.2472   0.2516     0.25
## Detection Prevalence   0.2512   0.2472   0.2516     0.25
## Balanced Accuracy      1.0000   1.0000   1.0000     1.00
# get the feature real names
names <-  colnames(dailysport[,-1])
# compute feature importance matrix
importance_matrix = xgb.importance(feature_names = names, model = bst_model)
gp = xgb.ggplot.importance(importance_matrix)
print(gp) 

As we can see the predictions are the same. We assumed above that the points are well separeted. The last plot shows us the most used variables to split the data. To check our assumption above we are going to plot the first 3 variables.

set.seed(123)

#small dataset
ds<-ds.origin %>% subset(select=c('id','LA-xMag','T-yMag','RL-zMag'))%>%droplevels()
colnames(ds)<-c('id','LA_xMag','T_yMag','RL_zMag')

# sample 1000 points from our dataset
points<-ds[sample(nrow(ds),30000),]

# 3D scatterplot
plot_ly(points, x = ~LA_xMag, y = ~T_yMag, z = ~RL_zMag,color=~id,colors = c('green', 'purple','red','blue'),marker=list(size=4)) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'x'),
                     yaxis = list(title = 'y'),
                     zaxis = list(title = 'z')))

As we expected, the classes are well separated.